Skip to content

Conversation

@abcxff
Copy link
Contributor

@abcxff abcxff commented Nov 8, 2025

Fixes RVT-5269

@vercel
Copy link

vercel bot commented Nov 8, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 19, 2025 5:34pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 19, 2025 5:34pm
rivet-inspector Ignored Ignored Preview Nov 19, 2025 5:34pm
rivet-site Ignored Ignored Preview Nov 19, 2025 5:34pm

Copy link
Contributor Author

abcxff commented Nov 8, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title feat: pegboard serverless pool feat(pegboard): move pegboard serverless to wf Nov 8, 2025
@abcxff abcxff requested a review from MasterPtato November 8, 2025 01:25
@abcxff abcxff marked this pull request as ready for review November 8, 2025 01:25
@linear
Copy link

linear bot commented Nov 8, 2025

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 8, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3416

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3416

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3416

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3416

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3416

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3416

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3416

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3416

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3416

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3416

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3416

commit: 9db0849

@claude
Copy link

claude bot commented Nov 8, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR migrates pegboard serverless functionality from a separate package into the main pegboard package as workflows, and removes the unused internal package. The serverless runner lifecycle is now managed via the gasoline workflow engine.

Critical Issues

1. Corrupted Error Artifact Generation (BLOCKER)

The PR includes 88 malformed error artifact files with corrupted names containing autocomplete markers like raCompletionMarker and partial error codes.

Examples:

  • serverless_runner.invalid_configraCompletionMarker.json
  • seraCompletionMarkerrunner_config.invalid.json
  • serverless_runner.cnot_found.json

Root Cause: The error definitions in errors.rs are correct, but the artifact generation script has a bug creating incremental artifacts.

Action Required:

  • Delete all 86 malformed error artifact files
  • Fix the error artifact generation script
  • Regenerate only valid error artifacts

Code Quality Issues

2. Unnecessary field name in pool.rs:35
Line has redundant field initialization: config: config
Should be: config

3. Debug comment in runtime.rs:1
Leftover comment: runner wf see how signal fail handling

4. Activity Retry Strategy (runner.rs:115-116)
Using MAX values for timeout and retries is concerning. Infinite retries could cause zombie workflows. Is this intentional? Please add a comment explaining the rationale or set reasonable limits.

5. TODO about lifecycle state (runner.rs:229-234)
Known issue about lifecycle state management when connections close unexpectedly. Consider creating a GitHub issue to track this.

Positive Changes

Architecture Simplification

  • Consolidates serverless into main pegboard package
  • Reduces workspace complexity (removed 2 packages)
  • Improves code cohesion

Workflow-Based State Management

  • Provides durable state persistence
  • Better observability
  • Automatic retry and recovery

Security

  • Headers properly validated preventing injection attacks
  • Tokens handled securely

ActivityCtx Enhancements
Added useful methods: find_workflow, get_workflows, state accessor, update_workflow_tags

Test Coverage

No test files found for new serverless workflows. Consider adding:

  • Unit tests for serverless configuration parsing
  • Integration tests for pool scaling logic
  • Tests for graceful draining behavior

Recommendations

  1. CRITICAL: Fix error artifact generation bug
  2. HIGH: Document or fix infinite timeout/retry strategy
  3. MEDIUM: Add tests for serverless workflows
  4. LOW: Clean up debug comment and use shortened field init

Final Verdict

NEEDS CHANGES - Good architecture and code quality, but corrupted error artifacts must be fixed before merging. Once resolved, this is a solid refactoring.

Comment on lines 100 to 101
#[timeout = u64::MAX]
#[max_retries = usize::MAX]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about using this for outbound req retries, the issue is the backoff will never reset and will get longer and longer after each request.

Ideally we only have an increasing backoff for actual errors and not the request stopping normally or failing after drain.

This will require a loop with state tracking last retry and retry count and a resetting mechanism (see pb actor reschedule_actor fn for reference)

Tracking with https://linear.app/rivet-gg/issue/RVT-5274/add-backoff-to-pb-serverless

@claude
Copy link

claude bot commented Nov 12, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR successfully migrates the pegboard serverless autoscaler from a standalone service to a workflow-based architecture. The refactoring removes ~524 lines from pegboard-serverless package and integrates the functionality into the pegboard workflows using the Gasoline workflow engine.

Code Quality & Architecture

Strengths

  1. Clean Separation of Concerns: The new implementation splits responsibilities across three well-defined workflows (pool.rs, runner.rs, connection.rs)

  2. Improved Reliability: Moving to workflows provides durable state management, automatic retries, and better observability

  3. Package Consolidation: Removes two unused packages

  4. Error Handling: Proper structured errors using the custom RivetError system

  5. API Modernization: The bump autoscaler endpoint now uses proper signal-based workflow communication

Areas for Improvement

1. Critical: Unclosed Event Source on Early Exit (connection.rs:196-233)

When the stream ends early, the source EventSource is not properly closed before returning. This can lead to resource leaks.
Recommendation: Call source.close() before returning in the StreamEnded error case.

2. Potential Race Condition (connection.rs:241-262)

The code attempts to send RunnerDrainStarted signal inline with bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING(), but if it fails, relies on the workflow to send it durably.
Question: Is there a race where both the activity retry and workflow durable send could succeed?

3. TODO Comment About Lifecycle State (connection.rs:225-232)

Known issue where unexpected connection closures before receiving runner_id can cause lifecycle state problems.
Recommendation: Create tracking issue or implement timeout-based cleanup.

4. Gasoline API Extension

The PR adds signal(), find_workflow(), and get_workflows() methods to ActivityCtx. Activities sending signals breaks typical workflow orchestration patterns.
Question: Is this a deliberate architectural decision? What are the implications for workflow determinism?

5. Infinite Retry Configuration (connection.rs:72-73)

Activity has max_retries = usize::MAX and timeout = u64::MAX without documentation.
Recommendation: Add comment explaining why infinite retries are safe.

Security Considerations

Good: Comprehensive header validation, URL validation, proper base64 error handling

Concern: Admin token added to headers - ensure TLS-only and trusted endpoints. Response bodies in error messages (though properly limited to 512 bytes).

Performance

Good: Efficient streaming queries, concurrent workflow operations, graceful draining

Opportunity: Consider aggressive caching for high-frequency scaling, batch signaling for runner draining

Test Coverage

Missing: No new test files for workflow implementations.
Recommendation: Add unit tests for autoscaling logic, integration tests for state transitions, error handling tests, lifecycle tests

Summary

This is a well-architected refactoring that improves maintainability and reliability. Main concerns are resource cleanup and activities sending signals.

Recommendation: Request changes to address EventSource leak, then approve after review.

High Priority

  • Fix EventSource resource leak on early exit
  • Document lifecycle state TODO
  • Clarify design decision on activities sending signals

Medium Priority

  • Add test coverage
  • Document infinite retry strategy
  • Add runner count validation

Generated with Claude Code

@abcxff abcxff requested a review from MasterPtato November 12, 2025 23:20
@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Move Pegboard Serverless to Workflow

Overview

This PR successfully migrates the serverless connection management from a service-based approach to a workflow-based architecture. The refactoring removes the pegboard-serverless and internal packages and consolidates the logic into three new workflows within the pegboard package.

✅ Strengths

Architecture Improvements

  • Better separation of concerns: The three-workflow structure (pool → runner → connection) provides clear boundaries and responsibilities
  • Workflow-based lifecycle: Moving to workflows provides better durability, observability, and state management compared to the previous service approach
  • Proper cleanup: Removed dead code (internal package, unused operations) reducing maintenance burden

Code Quality

  • Good error handling: The code properly handles edge cases like missing runner configs, invalid status codes, and workflow not found scenarios (connection.rs:459-470)
  • Exponential backoff: Implements proper retry logic with configurable parameters and reset mechanisms (connection.rs:499-505)
  • Structured logging: Uses tracing correctly with structured fields rather than formatted strings

Configuration Management

  • New configuration options: Added sensible defaults for serverless backoff behavior in config/pegboard.rs:59-110
  • Validation: Proper validation of serverless config including URL parsing and header limits (ops/runner_config/upsert.rs:102-136)

⚠️ Issues & Concerns

Critical Issues

  1. Race condition in pool scaling (pool.rs:23-30, 43-56)

    // Remove completed runners
    state.runners.retain(|r| !completed_runners.contains(r));
    
    // Calculate drain/start counts
    let drain_count = state.runners.len().saturating_sub(desired_count);

    If runners complete between getting completed_runners and draining, the count calculations could be incorrect. Consider recalculating after removal.

  2. Unhandled signal errors (connection.rs:335-356)
    The code catches signal send failures and logs them, but returns success with send_drain_started: true. If the workflow-based signal also fails, the parent pool may never know the runner drained. Consider more robust error handling.

  3. Missing timeout on activity (connection.rs:369-379)
    is_runner_draining has no timeout, but fetches workflow state. If the database is slow, this could hang indefinitely. Consider adding a timeout.

  4. TODO: Critical lifecycle issue (connection.rs:319-327)

    // TODO:
    // For unexpected closes, we don't know if the runner connected
    // or not bc we can't correlate the runner id.
    //
    // Lifecycle state falls apart

    This is a significant concern. If the SSE stream closes unexpectedly before receiving the runner ID, the system cannot properly track the runner's lifecycle. This could lead to resource leaks or orphaned runners.

Security Concerns

  1. Header injection risk (connection.rs:225-258)
    While headers are validated during upsert (ops/runner_config/upsert.rs:109-136), the error handling silently filters out invalid headers at runtime:

    .flat_map(|(k, v)| {
        // NOTE: This will filter out invalid headers without warning
        Some((
            k.parse::<HeaderName>().ok()?,
            v.parse::<HeaderValue>().ok()?,
        ))
    })

    This could hide configuration errors. Consider logging warnings for filtered headers.

  2. Admin token exposure (connection.rs:216-223)
    The admin token is sent in headers to external serverless endpoints. Ensure these endpoints are trusted and connections are secure (HTTPS enforced).

Code Quality Issues

  1. Deprecated header still sent (connection.rs:251-255)

    // Deprecated
    (
        HeaderName::from_static("x-rivet-namespace-id"),
        HeaderValue::try_from(namespace.name)?,
    )

    If this is deprecated, consider removing it or documenting the deprecation timeline.

  2. Magic constants (connection.rs:23)

    const DRAIN_GRACE_PERIOD: Duration = Duration::from_secs(5);

    This should be configurable via the config system like other timeouts.

  3. Unbounded workflow state growth (pool.rs:14-16)

    struct LifecycleState {
        runners: Vec<Id>,
    }

    The runners vec grows without bounds. With autoscaling, this could become large. Consider pagination or a different tracking mechanism for large deployments.

Testing Concerns

  • No unit tests: No test files were found in the pegboard package for the new workflow code
  • No integration tests visible: The PR doesn't show test coverage for the new workflow interactions
  • Complex state machine: The connection retry logic and lifecycle management are complex but untested

Minor Issues

  1. Potential integer overflow (pool.rs:173-178)

    let desired_count = (runners_margin
        + (adjusted_desired_slots as u32).div_ceil(slots_per_runner))
    .max(min_runners)
    .min(max_runners)
    .try_into()?;

    The comment says these are in u32 range, but there's no validation. Consider explicit bounds checking.

  2. Unclear comment (connection.rs:339)

    // This is ok, because we only send DrainStarted once
    .bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING()

    The bypass is used, but the rationale could be clearer. What happens if it's sent multiple times?

  3. Inconsistent error handling (runner.rs:52-54)
    The runner workflow waits for the connection workflow to complete, but doesn't check if it succeeded or failed. Consider handling workflow errors.

📋 Recommendations

High Priority

  1. Add comprehensive tests for the new workflows, especially:

    • Connection retry logic and backoff behavior
    • Pool scaling edge cases
    • Drain signal handling
    • Error recovery scenarios
  2. Resolve the TODO at connection.rs:319-327 regarding lifecycle tracking for unexpected closes

  3. Add timeouts to all activities that make external calls or database queries

  4. Document the workflow interaction patterns and state machines

Medium Priority

  1. Make DRAIN_GRACE_PERIOD configurable
  2. Add warnings for filtered invalid headers
  3. Consider adding metrics/alerts for:
    • Connection retry counts
    • Failed signal deliveries
    • Unexpected stream closes

Low Priority

  1. Remove deprecated headers or document the timeline
  2. Add bounds checking for integer conversions
  3. Consider implementing pagination for large runner lists in pool state

🎯 Overall Assessment

This is a solid architectural improvement that moves serverless connection management to a more maintainable workflow-based approach. The code quality is generally good with proper error handling and structured logging. However, the lack of tests for such critical infrastructure code is concerning, and there are several edge cases (race conditions, signal failures, lifecycle tracking) that need addressing before this can be considered production-ready.

Recommendation: Request changes to address the critical issues, especially the lifecycle tracking TODO and test coverage, before merging.


Note: Review based on repository conventions in CLAUDE.md

@abcxff abcxff requested a review from MasterPtato November 19, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants